Analysis of single-plex TMT experiments

MaxQuant

Phil Wilmarth, OHSU PSR Core, April 2019

Overview and objectives

Most ways that isobaric labeling data are analyzed hinder statistical testing and data visualization. This notebook will analyze data from (Ref-1) to demonstrate a two-condition (3 controls and 4 treatments) comparison done with 10-plex TMT reagents using the SPS MS3 method (Ref-2) on a Thermo Fusion instrument. The notebook will show how to:

  • load in prepped results table and check the data
  • normalize the data and visualize the result
  • compare the two conditions using edgeR
  • visualize the differential expression candidates
  • export the statistical test results in a useful format
  • compare two-sample t-test to edgeR

Experiment background and MaxQuant data processing

The KUR1502 data were mouse bone marrow cell cultures where the "media" samples are controls (one of the 4 did not label) and there are leukemia exosome-dosed cells (4 "exosome" samples). 10-plex TMT was used on a Thermo Fusion using the SPS MS3 method. The data were processed using MaxQuant (Ref-3), version 1.6.5.0.

Most parameters were kept at default values. The type of experiment was "Reporter ion MS3" and the reporter mass tolerance was kept at the default of 0.003 Da. MaxQuant has precursor ion purity filters for MS2 reporter ions but not for MS3 reporter ion experiments. The FASTA protein database was the same one as used in the Comet/PAW processing.

It is not clear exactly what "reporter ion intensities" are in MaxQuant. Most quantities called "intensities" seem to be peak heights in MaxQuant. There are a couple of other significant differences in how MaxQuant behaves compared to the PAW pipeline. MaxQuant uses a protein ranking function and decoy protein matches to implement an ad hoc protein FDR filter. Single peptide per protein identifications are allowed. How correct and incorrect PSMs relate to score thresholds and how correct and incorrect proteins meet evidence requirements for reporting are very different processes. As such, a simplistic application of the target/decoy method from PSM identification is not really valid for proteins. The PAW pipeline tracks decoy protein identifications for protein FDR estimates, but does not do any explicit error contrl at the protein level. If a resulting protein FDR is deemed too high, the remedy is to filter PSMs more strictly to reduce the number of incorrect peptide sequences. The PAW pipeline uses a minimum protein reporting criterion of two peptides per protein.

Treatment of shared peptides is also different. MaxQuant uses an all-or-nothing razor peptide method. For sets of proteins linked by shared peptides, one protein (or group) will have the strongest evidence (what about ties?) and gets all of the shared peptides. The shared peptides are excluded from the proteins with less evidence. The PAW pipeline instead groups together proteins that have mostly shared peptides in common. The combined protein "family" gets all of the peptides from the individual family members. Many housekeeping genes can have almost all peptides being shared with sparse unique peptide evidence (actins, tubulins, etc.). In the razor peptide approach, one family member can get assigned extensive peptide information. The other family members only get the sparse unique evidence. These sparse family members can be problematic with unreliable data and appearing to be at extremely different relative expression levels compared to other family members. The magnitude of these issues depends on underlying genomic complexity, choice of protein database, and sample composition. Grouping of homologous proteins is a more reliable and stable approach.

1. Huan, J., Hornick, N.I., Goloviznina, N.A., Kamimae-Lanning, A.N., David, L.L., Wilmarth, P.A., Mori, T., Chevillet, J.R., Narla, A., Roberts Jr, C.T. and Loriaux, M.M., 2015. Coordinate regulation of residual bone marrow function by paracrine trafficking of AML exosomes. Leukemia, 29(12), p.2285.

2. McAlister, G.C., Nusinow, D.P., Jedrychowski, M.P., Wühr, M., Huttlin, E.L., Erickson, B.K., Rad, R., Haas, W. and Gygi, S.P., 2014. MultiNotch MS3 enables accurate, sensitive, and multiplexed detection of differential expression across cancer cell line proteomes. Analytical chemistry, 86(14), pp.7150-7158.

3. Cox, J. and Mann, M., 2008. MaxQuant enables high peptide identification rates, individualized ppb-range mass accuracies and proteome-wide protein quantification. Nature biotechnology, 26(12), p.1367.


Load libraries and read in the data file

In [1]:
# load libraries
library("tidyverse")
library("psych")
library("gridExtra")
library("scales")
library("limma") 
library("edgeR") 

# read the grouped protein summary file
MQ_raw <- read_tsv("KUR1502_data-export.txt")
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0       ✔ purrr   0.3.2  
✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
✔ tidyr   0.8.3       ✔ stringr 1.4.0  
✔ readr   1.3.1       ✔ forcats 0.4.0  
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: ‘psych’

The following objects are masked from ‘package:ggplot2’:

    %+%, alpha


Attaching package: ‘gridExtra’

The following object is masked from ‘package:dplyr’:

    combine


Attaching package: ‘scales’

The following objects are masked from ‘package:psych’:

    alpha, rescale

The following object is masked from ‘package:purrr’:

    discard

The following object is masked from ‘package:readr’:

    col_factor

Parsed with column specification:
cols(
  Accession = col_character(),
  Media_2.1 = col_double(),
  Media_2.2 = col_double(),
  Media_3.2 = col_double(),
  Exo_2.1 = col_double(),
  Exo_2.2 = col_double(),
  Exo_3.1 = col_double(),
  Exo_3.2 = col_double()
)

Extract the data table

In [2]:
# separate accessions from the data
accession <- MQ_raw$Accession
MQ_tmt <- MQ_raw %>% select(-Accession)

head(MQ_tmt)
nrow(MQ_tmt)
Media_2.1Media_2.2Media_3.2Exo_2.1Exo_2.2Exo_3.1Exo_3.2
108150012454001069200990220 848080 12809001568400
730260 907010 743560759690 624090 8968101048800
687340 856020 685910554000 435720 756950 863480
571540 713260 598050610810 474850 755830 850350
624970 768790 563760588560 500550 670370 722580
8720201093500 672060450790 324490 410460 472880
4878

Check zero replacement situation

During data preparation in Excel, a minimum sum of reporter ion intensity of 1000 was used. Missing data is mostly associated with low abundance, so a low abundance cutoff can remove most missing data. A replacement value of 20 was used after filtering. The smallest non-zero data points were around 50. We can see how many proteins had different numbers of zero replacements.

In [3]:
counts <- 1:nrow(MQ_tmt) # create vector of appropriate length
for(i in 1:nrow(MQ_tmt)){
    # TRUE will be coerced to 1 for the summing
    counts[i] <- sum(MQ_tmt[i, ] == 20)
}
table(counts) # create the summary
counts
   0    1    2    3    4    5 
4746   79   28   19    5    1 

We have more missing data with MaxQuant

The vast majority of the protein have no missing values or just one. There are 53 proteins that more than one missing value. We can keep an eye out for those in the final results sheet after the analysis. We really only need to worry about them if they get flagged as significant differential expression (DE) candidates.


What do the unnormalized data distributions look like?

We will use box plots, a common way to summarize distributions, and distribution density plots to visualize the Log base 10 of the intensities. The 3 media samples are in red color and the 4 exosome-dosed samples are in blue color.

In [4]:
# let's see what the starting data look like
color = c(rep("red", 3), rep("blue", 4))
boxplot(log10(MQ_tmt), col = color, notch = TRUE, main = "Starting MaxQuant data")

Do we need to do some data normalization?

The boxplots are not very well horizontally aligned. A global scaling seems in order. Each channel is supposed to be the same total amount of protein digest that was labeled. We would, therefore, expect the total sum of reporter ion intensities to be the same in each channel.

NOTE: The MaxQuant reporter ion intensity values differ in magnitude from PAW (or Proteome Discoverer) by a significant factor (about a factor of 8 smaller).

We can make a function that takes the column sums, computes the average sum, and scales each column so that its new sum will be the average.

In [5]:
SL_Norm <- function(df, color = NULL, plot = TRUE) {
    # This makes each channel sum to the average grand total
        # df - data frame of TMT intensities
        # returns a new data frame with normalized values
    
    # compute scaling factors to make colsums match the average sum
    norm_facs <- mean(c(colSums(df))) / colSums(df)
    cat("SL Factors:\n", sprintf("%-5s -> %f\n", colnames(df), norm_facs))
    df_sl  <- sweep(df, 2, norm_facs, FUN = "*")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(df_sl), col = color, notch = TRUE, main = "SL Normalized data")
    }
    df_sl
}

# SL norm the tmt data
MQ_tmt_sl <- SL_Norm(MQ_tmt, color)
SL Factors:
 Media_2.1 -> 0.877458
 Media_2.2 -> 0.712193
 Media_3.2 -> 0.997123
 Exo_2.1 -> 1.216413
 Exo_2.2 -> 1.494475
 Exo_3.1 -> 1.089059
 Exo_3.2 -> 0.957944

We can check the column sums before and after normalization

The normalization brought the boxplots into horizontal alignment. We can double check the computations by seeing if the column sums are equal after normalization.

In [6]:
print("Before:")
format(round(colSums(MQ_tmt), digits = 0), big.mark = ",")
print("After:")
format(round(colSums(MQ_tmt_sl), digits = 0), big.mark = ",")
[1] "Before:"
Media_2.1
'110,182,550'
Media_2.2
'135,750,485'
Media_3.2
' 96,959,547'
Exo_2.1
' 79,480,035'
Exo_2.2
' 64,691,993'
Exo_3.1
' 88,774,401'
Exo_3.2
'100,925,042'
[1] "After:"
Media_2.1
'96,680,579'
Media_2.2
'96,680,579'
Media_3.2
'96,680,579'
Exo_2.1
'96,680,579'
Exo_2.2
'96,680,579'
Exo_3.1
'96,680,579'
Exo_3.2
'96,680,579'

Data was loaded, TMT data extracted, and normalization checked

The basic starting steps in any analysis are to make sure the data was read in correctly. Proteomics summary files usually have lots of information in addition to the quantitative data. Some work to extract the correct quantitative columns will be needed (this is often called the data wrangling part of an analysis). This can be done in Excel before reading into R (possible source of errors) or can be done in R (a different set of possible errors). Filtering of rows that should be excluded from the statistical testing is also necessary. Those include common contaminants, any decoy proteins, and proteins that might have too much missing data to be reliable. Some care to make sure that the testing results can be merged back with the proteomics results has to be used.

The extracted data may need some further formatting depending on the downstream statistical testing requirements. Here, we are saving the protein accession information as a vector and the actual TMT reporter ions as a 2D table of numbers.

Some exploratory statistics on the starting data are always a good idea. R has many functions to do far more than we have done here. We did some very basic checks to see what the data looked like before and after one simple normalization method. The basic features of the data look okay, so we will move on to the statistical testing.


A brief history of quantitative proteomics

In the dark ages before proteomics, mass spectrometers were made of stone (sector magnets and chemical ionization) and the t-test was developed to test beer (an essential ingredient of proteomics from the very beginning). Then came the not-so-golden ratio age (Penning ion traps, Q-TOFs, ICAT, and SILAC) where quantitative proteomics became synonymous with ratios and fold-changes. Battle lines were drawn between complicated standard formats with lengthy command line processing pipelines and the enduring allure of mysterious "black box" commercial products.

Ironically, progress on the instrument side seems to have accelerated while advances on the processing side seems to be in an ice age glacial state. We have seen an increase in the number of standard formats and their complexity. They seemed to have found the sweet spot of being quite hard for both humans and computers to read. We now have shiny "open source" black boxes to compete with more faded "commercial" black boxes. Statistical testing has evolved from one ratio (and no need for statistics) to boatloads of ratios as multiplexing capacity has increased. This leads to a literal "log2" jam that might involve beer consumption because it always ends up with a t-test.

Can we actually do quantitative proteomics without going down this ancient, overgrown path to failure? The answer is yes. Genomics had a head start on proteomics and has been doing the larger sample number, complicated experimental designs that biological studies require for many years. Modern instruments and methods are producing quantitative proteomics datasets that are much more similar to genomics data than they were in the past. If we do a little research (are we not scientists?) into genomic tools and data formats, we can learn how to prepare proteomics data to leverage these more mature statistical packages and stop trying to invent square wheels and making sacrifices to the Volcano plot Gods.

Joking aside, proteomics needs to recognize the similarity between proteomics and genomics data and abandon outdated thinking that was never all that useful. Computers and data science methods have changed just as much as new instruments like the timsTOF have changed compared to a sector magnet mass specs.


Differential expression (DE) testing

This is not the place for a review of hypothesis testing. It is, however, a good place to get the gist of statistical testing in this context and define some concepts for the rest of the notebook. There are many types of proteomics experiments, and they fit into many analysis pigeon holes. A common scenario is testing if a protein is expressed at the same or different levels in a background of many other proteins.

Complex protein mixtures from two or more biological conditions (such as control and treatment, or normal and disease) are compared by making measurements of relative abundances. There are some hidden constraints in these experimental designs. The total amount of protein is usually fixed at the same value across all samples (an individual biological replicate) and there is a dynamic range within the proteomes. High abundance proteins are constrained by the total protein amount and do not have the same freedom for abundance differences that proteins that are smaller fractions of the total have. Higher abundance proteins must, by experimental design, have moderated differences in means and reduced variances. Lower abundance proteins have more freedom and will not really appear to be constrained.

The constrained nature of these experimental designs (both genomics and proteomics) also has implications in normalization algorithms. An increased expression in abundant proteins will "push" all other proteins down (their abundance total has to remain the same). Grand total normalization like we use above can be too simplistic. The Bioconductor (Ref-4) differential expression testing package edgeR (Ref-5) includes a trimmed mean of M values (TMM) normalization step (Ref-6) designed for these types of samples.

4. Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J. and Hornik, K., 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5(10), p.R80.

5. Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.

6. Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.


The next parts of the notebook will demonstrate using edgeR to do DE testing:

  • proteomics data can be very similar to genomics data
  • robust genomics statistical packages can be used
  • statistical models and test results can be checked with visualizations

Load data into edgeR data structures

Bioconductor uses its own data classes. The main data table is organized by each gene or protein in a separate row, and each biological sample in a separate column. Genes/proteins can have rich annotations and the samples can have metadata. Think of this like a table of information about the rows and a table of information about the columns. These are linked tables. If you change the data rows, you have to update the row table, etc.

EdgeR has extensive documentation and you can read more about its data classes in the user's guide. We need to get the reporter ion intensities into a DGEList object. In addition to the data table, we can add the protein accessions as row labels, and the group membership (plain media or exosome dosed) information. We will use variable names similar to what is in the user's guide to make learning more about edgeR easier.

In [7]:
# load data into DGEList object
group <- c(rep("media", 3), rep("exosome", 4))
y <- DGEList(counts = MQ_tmt, group = group, genes = accession)

# we can see what y looks like
y
$counts
Media_2.1Media_2.2Media_3.2Exo_2.1Exo_2.2Exo_3.1Exo_3.2
108150012454001069200990220 848080 12809001568400
730260 907010 743560759690 624090 8968101048800
687340 856020 685910554000 435720 756950 863480
571540 713260 598050610810 474850 755830 850350
624970 768790 563760588560 500550 670370 722580
8720201093500 672060450790 324490 410460 472880
704940 783680 540030558110 485830 587730 612800
430460 543770 517050461710 392110 633940 695120
519420 578450 525680406460 320940 550170 611990
636140 822030 501890392590 299830 362810 448660
472090 562870 297500494140 483680 600650 547300
516570 614480 433580407450 366480 475560 496270
474830 608440 331810369300 351140 461060 509860
437520 540420 437910379570 305600 451300 524380
469610 570420 428130396210 319790 389490 442840
379090 457450 397190381120 301150 465690 542970
365250 461620 376000338670 294950 476220 583130
402010 417750 397230361670 285320 491930 502510
327830 424280 443310362900 276560 413030 535790
593000 709610 433040275630 195010 240540 278840
438190 586350 371520329560 259640 294930 345840
545450 696040 403710266880 196470 242400 279730
440720 494100 417970294300 210540 326270 379050
430290 531220 357490311320 249090 286540 330660
349530 365750 557740179290 145010 373120 416430
375360 465440 361840275980 225900 283520 340750
285470 327040 313720302250 256540 410720 443440
351750 443590 279050265150 217580 298150 340190
23068 29475 71205 26553 35673 1134700 843890
330190 388010 319470256760 212050 277420 313850
⋮⋮⋮⋮⋮⋮⋮
218446 20142 20185159
207101193 83256 60247
20163 66 20 20710206
121237184 77206215103
201469199107 20 20163
272307205117 90 82 60
261237251107114 91 73
183 20 20133 64 20754
155369283169 20 20155
91507261111 20144 20
20153 20260108371223
333310182 72 20 66145
135265116126107213142
164375137 20 90157180
261156274 66 58114159
214463169 20 20155 86
223625142 93 20 20 20
115192174160159102180
150488145 87144 68 20
241411266 20 20 20145
160113 20111139122331
64 72 49272256186159
144228213 89 76127170
263221188194105 20 75
492472 74 20 20 20 20
236253136 20 20 70340
278321 80117 48112 75
113187451119 20 20143
272227 20198 87 20229
110 79110235142221113
$samples
grouplib.sizenorm.factors
Media_2.1media 1101825501
Media_2.2media 1357504851
Media_3.2media 969595471
Exo_2.1exosome 794800351
Exo_2.2exosome 646919931
Exo_3.1exosome 887744011
Exo_3.2exosome 1009250421
$genes
genes
sp|P60710|ACTB_MOUSE
sp|Q8VDD5|MYH9_MOUSE
sp|P17182|ENOA_MOUSE
sp|P52480|KPYM_MOUSE
sp|Q8BTM8|FLNA_MOUSE
sp|P63017|HSP7C_MOUSE
sp|P62806|H4_MOUSE
sp|P16858|G3P_MOUSE
sp|P09411|PGK1_MOUSE
sp|P10126|EF1A1_MOUSE
sp|P31725|S10A9_MOUSE
sp|Q8CGP2|H2B1P_MOUSE
sp|P11247|PERM_MOUSE
sp|P26039|TLN1_MOUSE
sp|P20029|BIP_MOUSE
sp|P09103|PDIA1_MOUSE
sp|P05064|ALDOA_MOUSE
sp|Q61233|PLSL_MOUSE
sp|P20152|VIME_MOUSE
sp|P58252|EF2_MOUSE
sp|P68372|TBB4B_MOUSE
sp|P07901|HS90A_MOUSE
sp|P06151|LDHA_MOUSE
sp|P68373|TBA1C_MOUSE
sp|Q61207|SAP_MOUSE
sp|P08113|ENPL_MOUSE
sp|Q9JKF1|IQGA1_MOUSE
sp|P40142|TKT_MOUSE
sp|Q5SX39|MYH4_MOUSE
sp|Q68FD5|CLH1_MOUSE
â‹®
sp|Q91YP0|L2HDH_MOUSE
sp|Q8K0Q5|RHG18_MOUSE
sp|Q00560|IL6RB_MOUSE
sp|Q91X96|MSS4_MOUSE
sp|O70338|RNH1_MOUSE
sp|Q8VE65|TAF12_MOUSE
sp|Q8K2V6|IPO11_MOUSE
sp|Q75VT8|HIDE1_MOUSE
sp|O08901|BUB1_MOUSE
sp|Q8CI61|BAG4_MOUSE
sp|P01590|IL2RA_MOUSE
sp|Q6PGF3|MED16_MOUSE
sp|Q8BHI7|ELOV5_MOUSE
sp|Q8BPM2|M4K5_MOUSE
sp|Q501J7|PHAR4_MOUSE
sp|Q9EPU5|TNR21_MOUSE
sp|Q3TUU5|TEX30_MOUSE
sp|P58501|PAXB1_MOUSE
sp|Q91W18|TDRD3_MOUSE
tr|B7ZNL9|B7ZNL9_MOUSE
sp|P48725|PCNT_MOUSE
sp|P58749|TM6S1_MOUSE
sp|Q8CIA9|MF14B_MOUSE
sp|P53995|APC1_MOUSE
sp|Q80WE4|KI20B_MOUSE
sp|Q8C0L9|GPCP1_MOUSE
sp|Q9CPW7|ZMAT2_MOUSE
sp|Q8BWG9|CRCM1_MOUSE
sp|Q99JP6|HOME3_MOUSE
sp|Q9D1I2|CAR19_MOUSE

We loaded the unnormalized TMT data

EdgeR normalization is actually done in two steps. The first, called a library size adjustment, is like the SL normalization we did above. This gets the rid of the big differences between samples so that the TMM algorithm has better starting data.

In [8]:
# run the TMM normalization
y <- calcNormFactors(y)

# check what changed
y
$counts
Media_2.1Media_2.2Media_3.2Exo_2.1Exo_2.2Exo_3.1Exo_3.2
108150012454001069200990220 848080 12809001568400
730260 907010 743560759690 624090 8968101048800
687340 856020 685910554000 435720 756950 863480
571540 713260 598050610810 474850 755830 850350
624970 768790 563760588560 500550 670370 722580
8720201093500 672060450790 324490 410460 472880
704940 783680 540030558110 485830 587730 612800
430460 543770 517050461710 392110 633940 695120
519420 578450 525680406460 320940 550170 611990
636140 822030 501890392590 299830 362810 448660
472090 562870 297500494140 483680 600650 547300
516570 614480 433580407450 366480 475560 496270
474830 608440 331810369300 351140 461060 509860
437520 540420 437910379570 305600 451300 524380
469610 570420 428130396210 319790 389490 442840
379090 457450 397190381120 301150 465690 542970
365250 461620 376000338670 294950 476220 583130
402010 417750 397230361670 285320 491930 502510
327830 424280 443310362900 276560 413030 535790
593000 709610 433040275630 195010 240540 278840
438190 586350 371520329560 259640 294930 345840
545450 696040 403710266880 196470 242400 279730
440720 494100 417970294300 210540 326270 379050
430290 531220 357490311320 249090 286540 330660
349530 365750 557740179290 145010 373120 416430
375360 465440 361840275980 225900 283520 340750
285470 327040 313720302250 256540 410720 443440
351750 443590 279050265150 217580 298150 340190
23068 29475 71205 26553 35673 1134700 843890
330190 388010 319470256760 212050 277420 313850
⋮⋮⋮⋮⋮⋮⋮
218446 20142 20185159
207101193 83256 60247
20163 66 20 20710206
121237184 77206215103
201469199107 20 20163
272307205117 90 82 60
261237251107114 91 73
183 20 20133 64 20754
155369283169 20 20155
91507261111 20144 20
20153 20260108371223
333310182 72 20 66145
135265116126107213142
164375137 20 90157180
261156274 66 58114159
214463169 20 20155 86
223625142 93 20 20 20
115192174160159102180
150488145 87144 68 20
241411266 20 20 20145
160113 20111139122331
64 72 49272256186159
144228213 89 76127170
263221188194105 20 75
492472 74 20 20 20 20
236253136 20 20 70340
278321 80117 48112 75
113187451119 20 20143
272227 20198 87 20229
110 79110235142221113
$samples
grouplib.sizenorm.factors
Media_2.1media 1101825501.0418281
Media_2.2media 1357504851.0603080
Media_3.2media 969595471.0351963
Exo_2.1exosome 794800350.9848199
Exo_2.2exosome 646919930.9951479
Exo_3.1exosome 887744010.9376066
Exo_3.2exosome 1009250420.9516650
$genes
genes
sp|P60710|ACTB_MOUSE
sp|Q8VDD5|MYH9_MOUSE
sp|P17182|ENOA_MOUSE
sp|P52480|KPYM_MOUSE
sp|Q8BTM8|FLNA_MOUSE
sp|P63017|HSP7C_MOUSE
sp|P62806|H4_MOUSE
sp|P16858|G3P_MOUSE
sp|P09411|PGK1_MOUSE
sp|P10126|EF1A1_MOUSE
sp|P31725|S10A9_MOUSE
sp|Q8CGP2|H2B1P_MOUSE
sp|P11247|PERM_MOUSE
sp|P26039|TLN1_MOUSE
sp|P20029|BIP_MOUSE
sp|P09103|PDIA1_MOUSE
sp|P05064|ALDOA_MOUSE
sp|Q61233|PLSL_MOUSE
sp|P20152|VIME_MOUSE
sp|P58252|EF2_MOUSE
sp|P68372|TBB4B_MOUSE
sp|P07901|HS90A_MOUSE
sp|P06151|LDHA_MOUSE
sp|P68373|TBA1C_MOUSE
sp|Q61207|SAP_MOUSE
sp|P08113|ENPL_MOUSE
sp|Q9JKF1|IQGA1_MOUSE
sp|P40142|TKT_MOUSE
sp|Q5SX39|MYH4_MOUSE
sp|Q68FD5|CLH1_MOUSE
â‹®
sp|Q91YP0|L2HDH_MOUSE
sp|Q8K0Q5|RHG18_MOUSE
sp|Q00560|IL6RB_MOUSE
sp|Q91X96|MSS4_MOUSE
sp|O70338|RNH1_MOUSE
sp|Q8VE65|TAF12_MOUSE
sp|Q8K2V6|IPO11_MOUSE
sp|Q75VT8|HIDE1_MOUSE
sp|O08901|BUB1_MOUSE
sp|Q8CI61|BAG4_MOUSE
sp|P01590|IL2RA_MOUSE
sp|Q6PGF3|MED16_MOUSE
sp|Q8BHI7|ELOV5_MOUSE
sp|Q8BPM2|M4K5_MOUSE
sp|Q501J7|PHAR4_MOUSE
sp|Q9EPU5|TNR21_MOUSE
sp|Q3TUU5|TEX30_MOUSE
sp|P58501|PAXB1_MOUSE
sp|Q91W18|TDRD3_MOUSE
tr|B7ZNL9|B7ZNL9_MOUSE
sp|P48725|PCNT_MOUSE
sp|P58749|TM6S1_MOUSE
sp|Q8CIA9|MF14B_MOUSE
sp|P53995|APC1_MOUSE
sp|Q80WE4|KI20B_MOUSE
sp|Q8C0L9|GPCP1_MOUSE
sp|Q9CPW7|ZMAT2_MOUSE
sp|Q8BWG9|CRCM1_MOUSE
sp|Q99JP6|HOME3_MOUSE
sp|Q9D1I2|CAR19_MOUSE

edgeR computes correction factors

EdgeR does not change the counts table. The "samples" list slot does get new norm.factor values. EdgeR uses the normalization factors as offsets in its modeling. We will want to check the TMM step and visualize the data results, so we need to get a table of normalized data values. A more detailed discussion of edgeR normalization can be found here. We will use a function that returns a data frame of normalized data from a DGEList object.

In [9]:
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # returns a dataframe with normalized intensities
    
    # compute grand total (library size) scalings
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size

    # the TMM factors are library adjustment factors (so divide by them)
    norm_facs <- lib_facs / y$samples$norm.factors
    cat("Overall Factors (lib.size+TMM):\n", sprintf("%-5s -> %f\n", 
                                                     colnames(y$counts), norm_facs))

    # compute the normalized data as a new data frame
    tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
    }
    tmt_tmm
}

# get the normalized data values
MQ_tmt_tmm <- apply_tmm_factors(y, color)
Overall Factors (lib.size+TMM):
 Media_2.1 -> 0.842229
 Media_2.2 -> 0.671685
 Media_3.2 -> 0.963221
 Exo_2.1 -> 1.235163
 Exo_2.2 -> 1.501762
 Exo_3.1 -> 1.161531
 Exo_3.2 -> 1.006598

Examine returned data and compute column totals

In [10]:
head(MQ_tmt_tmm)

print("After TMM:")
format(round(colSums(MQ_tmt_tmm), digits = 0), big.mark = ",")
Media_2.1_tmmMedia_2.2_tmmMedia_3.2_tmmExo_2.1_tmmExo_2.2_tmmExo_3.1_tmmExo_3.2_tmm
910871.0 836516.8 1029875.91223083.41273614.21487805.21578748.8
615046.4 609225.3 716212.6 938341.2 937234.61041672.71055720.3
578897.9 574976.0 660682.9 684280.5 654347.7 879221.0 869177.5
481367.8 479086.2 576054.3 754450.1 713111.6 877920.1 855960.9
526368.1 516384.9 543025.5 726967.7 751706.9 778655.6 727347.8
734440.8 734487.8 647342.3 556799.3 487306.7 476762.1 476000.2
[1] "After TMM:"
Media_2.1_tmm
' 92,798,975'
Media_2.2_tmm
' 91,181,601'
Media_3.2_tmm
' 93,393,473'
Exo_2.1_tmm
' 98,170,820'
Exo_2.2_tmm
' 97,151,971'
Exo_3.1_tmm
'103,114,230'
Exo_3.2_tmm
'101,590,979'

Column totals are not all the same

The point of TMM is that changes in the abundance of more abundant proteins distort the grand totals. That is the trimmed part of the algorithm. With freedom to change the abundances of the more abundant proteins, the column totals will no longer stay constrained. Even though the column totals are not the same, the boxplots are nicely aligned.

If we have large adjustment factors, it means we have some compositional differences between samples. That has to be reconciled with what is known about the system under study. Generally, some compositional differences between conditions could make sense. Compositional differences between samples within the same condition could mean contaminating tissue in a dissection, for example.

It is critical to remember that most normalization methods assume large numbers of proteins with unchanged expression are present in the samples. These methods are for samples in conditions that are mostly similar in their proteomes. When that is not the case, much more care has to go into picking a normalization method.

CV distributions are a good metric to test normalizations

Now that we have the table of TMM normalized data, we can do a few more QC type visualizations. We should have smaller CV values when normalization has worked well. If we do not get relatively tight CV distributions with nice median CV values, there are two likely reasons. There is significant inherent biological variability between samples (humans), or the sample preparation is too variable (need to optimize more).

In [11]:
# define variables for the columns in each condition
M <- 1:3
E <- 4:7
In [12]:
CV <- function(df) {
    # Computes CVs of data frame rows
        # df - data frame, 
        # returns vector of CVs (%)
    ave <- rowMeans(df)    # compute averages
    sd <- apply(df, 1, sd) # compute standard deviations
    cv <- 100 * sd / ave   # compute CVs in percent (last thing gets returned)
}

labeled_boxplot <- function(df, ylim, title) {
    # Makes a box plot with the median value labeled
        # df - data frame with data to compute CVs of
        # ylim - upper limit for y-axis
        # title - plot title
    cv = CV(df)
    boxplot(cv, ylim = c(0, ylim), notch = TRUE, main = title)
    text(x = 0.65, y = boxplot.stats(cv)$stats[3], 
         labels = round(boxplot.stats(cv)$stats[3], 1))
}

# compare CV distributions
par(mfrow = c(2, 2))
labeled_boxplot(MQ_tmt[M], ylim = 150, title = "Starting Media CVs")
labeled_boxplot(MQ_tmt[E], ylim = 150, title = "Starting Exosome CVs")
labeled_boxplot(MQ_tmt_tmm[M], ylim = 75, title = "Media CVs after TMM")
labeled_boxplot(MQ_tmt_tmm[E], ylim = 75, title = "Exosome CVs after TMM")
par(mfrow = c(1, 1))

Check how the samples cluster

This is a two-condition comparison. One of the 4 media only samples did not label. The cultures were done in two batches ("2.x" and "3.x" in the labels). We expect to see the media only samples cluster together and be separated from a cluster of the exosome dosed samples.

In [13]:
# see how things cluster after we have normalized data
plotMDS(log2(MQ_tmt_tmm), col = c(rep("red", 3), rep("blue", 4)), main = "MaxQuant TMM")

Conditions separate left-to-right in first dimension

Media only clearly separates from exosome dosed in the leading dimension. These MDS plots are like principal component analysis. The x-axis is the major dimension of differences between samples, the y-axis is the second most important dimension of difference. The samples are separating left-to-right by condition. We probably also have a culture prep batch effect between batch 2 and 3. We should have more samples in each condition before attempting things like batch corrections, so we will focus on the differences between media and exosome conditions.

Compare samples to each other with scatter plots

Multi-panel scatter plot grids are another exploratory data analysis technique that is well supported in R. We can see how similar biological replicates are to each other within the median only condition or in the exosome dosed condition. We can also see if the samples between conditions seem different. We would expect to see larger differences (more scatter) between groups that within groups.

In [14]:
# scatter plots within groups and betwen groups
pairs.panels(log10(MQ_tmt_tmm), lm = TRUE, main = "After TMM")

Within condition samples are a bit tighter than between conditions

Correlation coefficients are larger within conditions than between conditions, but the differences are not large. Correlation coefficients are not a great metric. Media only samples are a little more similar between batch 2 and 3 than are the exosome dosed samples. Samples within the same condition and within the same batch are pretty similar to each other.

Back to the statistical testing...

The typical analysis goes something like this:

  • load the data into DGEList object
  • run TMM and normalize the data
  • check for experimental issues with cluster plot
  • estimate dispersion trends for moderated statistics
  • plot the Biological Coefficient of Variation (BCV)
  • do the pair-wise tests and get p-values, FDRs
  • check the p-value distribution
  • visualize DE results
    • MA plots
    • scatter plots
    • volcano plot
  • gather normalized data and test results into table
  • write out the final results table

We have already done the first 3 steps.

In [15]:
# we need to get dispersion estimates
y <- estimateDisp(y)
plotBCV(y, main = "Dispersion trends")
Design matrix not provided. Switch to the classic mode.

Dispersion depends on the protein intensity

There are commonly abundance-dependent dispersion effects in many measurement methods. Higher abundance signals have better signal-to-noise (or are the result of more averaging), so they can appear more precise than lower abundance signals. We see that the blue line increases as the protein intensity decreases. The statistical testing will use trended variances (estimated from multiple proteins) as a function of protein intensity, which spans several orders of magnitude.

If we had summarized the data using ratios, we would have collapsed the range on protein values into a concentrated blob centered at 1:1 ratios. We would lose all ability to have trended variance estimates. The trended variance is how the test statistics are moderated and is why limma and edgeR outperform t-tests in typical study designs where replicate numbers can be small. I do not know how many replicates in each condition are needed before the trended variance stops improving results (10, 20, 50?). I think most proteomics studies are still considerably short of replicate numbers where the t-test will work well. Avoiding ratios is very important in getting the best results.

edgeR exact test

We will use the exact test in edgeR for this simple two-state comparison. EdgeR has general linear model extensions that allow for more complicated experimental designs. They are a bit more work to set up and run. We will avoid that here, but the user's guide has many examples.

If you read up on edgeR or know a bit about statistics, you may wonder why we are using statistical testing built on a negative binomial distribution for counting experiments. The short answer is because it seems to work fine. The short longer story is that the variance is modeled by two terms: one for Poisson numbers and an over-dispersion term. As counts get larger (a few hundred or so), the Poisson term becomes small and the variance is handled by the over-dispersion term. The values of reporter ion peak heights, particularly when summed into protein totals, are large enough numbers. One could argue that limma would be a better choice. However, limma is a little more complicated to set up and run than edgeR.

Notebooks are a great framework for trying different analysis steps. In the previous version of this data set analysis, edgeR testing was compared to a two-sample t-test. Running different tests in R is not hard, what is hard is comparing different analysis methods and trying to define a metric to measure performance.

In [16]:
collect_results <- function(df, tt, x, xlab, y, ylab) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))

    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$PValue, FDR = tt$FDR, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = tt$genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# compute the exact test models, p-values, FC, etc.
et <- exactTest(y, pair = c("media", "exosome"))

# make the results table 
tt <- topTags(et, n = Inf, sort.by = "none")$table
med_exo <- collect_results(MQ_tmt_tmm, tt, M, "med", E, "exo")

Run test and reformat results

The exactTest function does the modeling for the specified pair. More than one pair can be present in DGEList objects to improve the dispersion estimates. The topTags function does the multiple testing corrections and returns more concise summaries of the testing results. The collect_results function combines some of the normalized data and test results into a sufficient results data frame.

Check if testing looks okay

It is important to see if the modeling looks reasonable. Our general assumptions are that we have a large fraction of the proteins that are not differentially expressed. Those will have a uniform (flat) p-value distribution from 0.0 to 1.0. We also expect (hopefully) some true differential expression candidates. Those should have very small p-values and have a sharper distribution at low p-values.

In [17]:
pvalue_plots <- function(results, ylim, title) {
    # Makes p-value distribution plots
        # results - results data frame
        # ylim - ymax for expanded view
        # title - plot title
    p_plot <- ggplot(results, aes(PValue)) + 
        geom_histogram(bins = 100, fill = "white", color = "black") +
        geom_hline(yintercept = mean(hist(results$PValue, breaks = 100, 
                                     plot = FALSE)$counts[26:100]))

    # we will need an expanded plot
    p1 <- p_plot + ggtitle(str_c(title, " p-value distribution"))
    p2 <- p_plot + coord_cartesian(xlim = c(0, 1.0), ylim = c(0, ylim)) + 
        ggtitle("p-values expanded")
    grid.arrange(p1, p2, nrow = 2) # from gridExtra package
}

# check the p-value distrubution
pvalue_plots(med_exo, 100, "Media vs Exosome-dosed")

We have the two distributions of p-values, so the testing seems reasonable. We can also see how the up-regulated protein number compares to the down-regulated number. We can use the topTags function to see which proteins have the smallest p-values.

In [18]:
# see how many up and down candidates (10% FDR)
summary(decideTests(et, p.value = 0.10))

# see which proteins have the smallest p-values
topTags(et)$table
       exosome-media
Down            1181
NotSig          2502
Up              1195
geneslogFClogCPMPValueFDR
2151sp|Q80V94|AP4E1_MOUSE 9.712092 6.351154 2.204329e-142 1.075272e-138
2518sp|Q6PDS3|SARM1_MOUSE 9.257728 5.905050 6.788691e-48 1.655762e-44
1057tr|E9Q6R7|E9Q6R7_MOUSE3.914449 8.044726 9.796961e-47 1.592986e-43
177sp|P54987|IRG1_MOUSE 2.578283 10.256279 5.549544e-45 6.767669e-42
2916tr|Q8R5I6|Q8R5I6_MOUSE8.735104 5.383087 5.384968e-44 5.253575e-41
480sp|P35173|CYT3_MOUSE 3.012063 9.226840 7.239595e-44 5.885791e-41
2670sp|Q9CRB6|TPPP3_MOUSE 9.078868 5.726764 9.860130e-44 6.871102e-41
3665sp|P04918|SAA3_MOUSE 3.462006 4.138469 1.271082e-42 7.750422e-40
2484sp|P33766|FPR1_MOUSE 3.225450 5.805522 4.588253e-42 2.486833e-39
1377sp|Q8BGW1|FTO_MOUSE 3.441914 7.394756 1.492332e-41 7.279594e-39

We can categorize candidates by ranges of adjusted p-values

In many discovery experiments, a Benjamini-Hochberg adjusted p-value (FDR) cutoff of 0.05 might be pretty strict. We use a 10% cutoff to distinguish DE from non-DE candidates. We define three cuts on the FDR: 10% to 5% are "low" significance, 5% to 1% are medium significance, and less than 1% are more "highly" significant. Cut values can be adjusted depending on the experimental situation.

We can look at expression ratio distributions as a function of candidate category. If variance is not too variable protein-to-protein, then we would expect larger mean differences to be associated with lower FDR values. Faceted plotting in ggplot2 is a nice method for showing such things.

In [19]:
# see how many candidates are in each category
med_exo %>% count(candidate)

log2FC_plots <- function(results, range, title) {
    # Makes faceted log2FC plots by candidate
        # results - results data frame
        # range - plus/minus log2 x-axis limits
        # title - plot title
    ggplot(results, aes(x = logFC, fill = candidate)) +
        geom_histogram(binwidth=0.1, color = "black") +
        facet_wrap(~candidate) +
        ggtitle(title) + 
        coord_cartesian(xlim = c(-range, range))
}

# can look at log2FC distributions as a check
log2FC_plots(med_exo, 3, "LogFC by candidate for Media vs Exosome-dosed")
candidaten
high1490
med 553
low 333
no 2502

Visualize the edgeR results several ways

  • MA plots
  • scatter plots
  • volcano plot

We need some transformed axes for MA plots and for volcano plots. We will make a function for that and also some functions for the plotting. MA plots first. The dotted lines indicate 2-fold changes.

In [20]:
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

MA_plots <- function(results, x, y, title, make_facet = TRUE) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # make_facet - flag to plot facet views
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    if (make_facet == TRUE) {
        ma_facet <- ggplot(temp, aes(x = A, y = M)) +
            geom_point(aes(color = candidate, shape = candidate)) +
            scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
            scale_x_continuous("log10 Ave_intensity") +
            ma_lines +
            facet_wrap(~ candidate) +
            ggtitle(str_c(title, " (separated)"))
    }

    # make the plots visible
    print(ma)
    if (make_facet == TRUE) {
         print(ma_facet)
    }
}    

# MA plots of DE candidates
MA_plots(med_exo, "ave_med", "ave_exo", "Media versus exosome-dosed")

Scatter plots

The solid diagonal line is 1:1, the dotted lines are 2-fold changes. The axes are in log scale.

In [21]:
scatter_plots <- function(results, x, y, title, make_facet = TRUE) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # make_facet - flag to plot facet views
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    if (make_facet == TRUE) {
        scatter_facet <- ggplot(results, aes_string(x, y)) +
            geom_point(aes(color = candidate, shape = candidate)) +
            scatter_lines +
            facet_wrap(~ candidate) +
            ggtitle(str_c(title, " (separated)")) 
    }

    # make the plots visible
    print(scatter)
    if (make_facet == TRUE) {
         print(scatter_facet)
    }
}

scatter_plots(med_exo, "ave_med", "ave_exo", "Media versus exosome-dosed")

Volcano plot

Volcano plots are another common way to visualize DE candidates.

In [22]:
volcano_plot <- function(results, x, y, title, ymax) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
        # ymax - upper limit for y-axis
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        coord_cartesian(xlim = c(-5, 5), ylim = c(0, ymax)) + 
        ggtitle(str_c(title, " Volcano Plot"))
}

# finally, a volcano plot
volcano_plot(med_exo, "ave_med", "ave_exo", "Media versus exosome-dosed", 50)

See what the top DE candidates look like

We can see how the intensities of the individual samples compare for the top up- and down-regulated candidate proteins.

In [23]:
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
    identifier <- str_split(accession, "\\|", simplify = TRUE)
    identifier[,3]
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_top_tags <- function(results, nleft, nright, top_tags) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", p-val: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1))
        barplot(vec, col = color, main = title)
    }    
}
# plot the top 25 up and 20 down proteins
set_plot_dimensions(7, 4)
plot_top_tags(med_exo, 3, 4, 25)
set_plot_dimensions(7, 7)

Summary

EdgeR normalizations and testing seemed reasonable for this data.

  • non-DE candidate much less than 2-fold (most are closer to 1:1)
  • larger fold changes associated with more significant candidates
  • wider dispersion at lower intensities is incorporated into edgeR testing
    • need larger fold changes at lower intensities to get the same significance
  • proteins with larger fold-changes have much more significant p-values
    • we have a lot of dynamic range in the p-values
  • we did not do any isotopic purity corrections
    • true fold-changes could be larger than measured

There were a large number of significant candidates. There are options in edgeR to add a minimum fold-change to the testing. That can be used to reduce the number of candidates. Stricter FDR cutoffs could also be used. Ranking candidates for follow-up study is not trivial. Statistical values like fold-changes and p-values may not be sufficient to determine biologically relevant rankings. Biological information for this system should be an important consideration.

This would be a much better area to direct informatics students to work on than yet another search engine. There is a very uneven distribution of where talent is directed in these workflows. Label reagents, instrument methods, search engines, and PSM validation have gotten the most attention. Normalizations and statistical models have received some attention. Visualizing results and making biological sense of things are critical steps that have been neglected for too long. This is the critical bottleneck.

Comparison of MaxQuant and PAW

One mystery is what are the reporter ion quantities reported by MaxQuant? The smallest non-zero values are around 50 and in the PAW processing they are around 350. The data in the dilution series analysis has been processed with MaxQuant and PAW. The column total intensities reflect the expected dilution factors. Those values are highly correlated between MaxQuant and PAW (see next cell).

Correlation between MaxQuant and PAW reporter ion intensities

In [24]:
# column totals from the dilution series (25, 20, 15, 10, 5, 2.5) data
MQ <- c(252042061, 194468567, 147076787, 101203918, 49272033, 26071287)
PAW <- c(1996324081, 1562525113, 1207642244, 835353557, 400841992, 221591661)
dilution <- data.frame(PAW = PAW, MQ = MQ)

# from https://sejohnston.com/2012/08/09/a-quick-and-easy-function-to-plot-lm-results-in-r/
ggplotRegression <- function (fit) {
    require(ggplot2)
    ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) + 
        geom_point(size = 3) +
        stat_smooth(method = "lm", col = "red") +
        labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5), 
                           "Intercept =",signif(fit$coef[[1]],5 ),
                           " Slope =",signif(fit$coef[[2]], 5),
                           " P =",signif(summary(fit)$coef[2,4], 5)))
}

ggplotRegression(lm(MQ ~ PAW, data = dilution))

More missing data in MaxQuant

There seem to be a few more missing data values on the low abundance end in MaxQuant compared to PAW. This could be due to overall lower sensitivity with MaxQuant compared to Comet/PAW (or many other search engines). The numbers of identified PSMs and proteins are in the table below.

Category Comet/PAW MaxQuant
MS2 Scans 272221 275695

PSMs (1% FDR)|85936|60766| Protein IDs|5462|5522| Proteins, 2 peptides|5462|4711|

PAW/Comet identified 41% more PSMs. Including single peptide per protein identifications pads the MaxQuant protein IDs and helps hide the poorer identification performance at the PSM level. I have a blog entry that talks more about MaxQuant performance here.

Top candidate overlap not so great for the top 25 up and top 25 down proteins

If people understood just how much testing p-values can change depending on data, model choice, and model parameter choices, maybe they would worry less about them. There were about 1200 up and 1200 down candidates at a 10% FDR cutoff. MaxQuant was a little under 1200 and PAW a little over 1200. The difference was not really too large. Seeing how well the top 25 compare based on p-values is not really a fair test. The full lists would need to be examined. For the many proteins that were the same in the top 25 plots, the expression patterns looked qualitatively similar. Things are probably more similar than different.

An important point is that these lists are a bit dynamic at the single protein level. Drawing conclusions from single proteins might be a perilous exercise. Most biological process and overall system level changes should involve many proteins. Hypotheses should be based on the behavior of multiple proteins. Also of real importance, is that the sets of proteins that drive the biological conclusions may not all be DE candidates at some arbitrary significance level. It is critical to have full results tables with statistical test results, normalized data, expression changes, and annotations for ALL quantifiable proteins. The tools to do the complex queries of these complex data sets for complex biological questions are desperately needed.

Getting results out of R

One step towards more meaningful results summaries are getting the main proteomics results and the statistical results combined into a single table. If we are careful with how we bring the data into R and how we take it back out, we can keep tables in order so merging results are easier. There are several strategies that can work well. If the data coming into R and the data coming out of R are not easy to align correctly, then edit the notebooks to make it work right. More data prep can occur before importing into R, for example. We kept the accessions column as we worked with the data. We can be really robust and use merge functions to weave together starting tables with final results tables.

We should always end notebooks with information about what packages and versions were used in the analysis.

In [25]:
# save the testing results
write.table(med_exo, file = "KUR1502_MQ-results.txt", sep = "\t",
            row.names = FALSE, na = " ")

# log the session
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] edgeR_3.24.3    limma_3.38.3    scales_1.0.0    gridExtra_2.3  
 [5] psych_1.8.12    forcats_0.4.0   stringr_1.4.0   dplyr_0.8.0.1  
 [9] purrr_0.3.2     readr_1.3.1     tidyr_0.8.3     tibble_2.1.1   
[13] ggplot2_3.1.0   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] pbdZMQ_0.3-3     locfit_1.5-9.1   tidyselect_0.2.5 repr_0.19.2     
 [5] splines_3.5.3    haven_2.1.0      lattice_0.20-38  colorspace_1.4-1
 [9] generics_0.0.2   htmltools_0.3.6  base64enc_0.1-3  rlang_0.3.3     
[13] pillar_1.3.1     foreign_0.8-71   glue_1.3.1       withr_2.1.2     
[17] modelr_0.1.4     readxl_1.3.1     uuid_0.1-2       plyr_1.8.4      
[21] munsell_0.5.0    gtable_0.3.0     cellranger_1.1.0 rvest_0.3.2     
[25] evaluate_0.13    labeling_0.3     parallel_3.5.3   broom_0.5.1     
[29] IRdisplay_0.7.0  Rcpp_1.0.1       backports_1.1.3  IRkernel_0.8.15 
[33] jsonlite_1.6     mnormt_1.5-5     hms_0.4.2        digest_0.6.18   
[37] stringi_1.4.3    grid_3.5.3       cli_1.1.0        tools_3.5.3     
[41] magrittr_1.5     lazyeval_0.2.2   crayon_1.3.4     pkgconfig_2.0.2 
[45] xml2_1.2.0       lubridate_1.7.4  assertthat_0.2.1 httr_1.4.0      
[49] rstudioapi_0.10  R6_2.4.0         nlme_3.1-137     compiler_3.5.3  

One last thing...

Let's see how a basic t-test works with the MQ data

The best statistical option in Perseus is probably a two-sample t-test with a multiple testing correction. Let's see how that compares to the edgeR analysis. We will use the shared variance option on the t-test to get a little more variance pooling.

In [26]:
# do the t-test on log transformed intensities to be safe
ttest_MQ <- log2(MQ_tmt_tmm)
# add average ratio columns (non-logged ratios), fold-change column, and row names
ttest_MQ$ave_med <- rowMeans(MQ_tmt_tmm[1:3])
ttest_MQ$ave_exo  <- rowMeans(MQ_tmt_tmm[4:7])
ttest_MQ$logFC <- log2(ttest_MQ$ave_exo / ttest_MQ$ave_med)
row.names(ttest_MQ) <- accession

# apply the basic two-sample t-test (we will pool variance)
t.result <- apply(ttest_MQ, 1, function(x) t.test(x[1:3], x[4:7], var.equal = TRUE))
# extract the p-value column from the t-test thingy 
ttest_MQ$PValue <- unlist(lapply(t.result, function(x) x$p.value))
# do a Benjamini-Hochberg multiple testing correction
ttest_MQ$FDR <- p.adjust(ttest_MQ$PValue, method = "BH")

# add a DE candidate status column
ttest_MQ$candidate <- cut(ttest_MQ$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                          labels = c("high", "med", "low", "no"))
    
# count up, down and the rest (FDR less than 0.05)
all <- dim(ttest_MQ)[1]
up <- dim(ttest_MQ[(ttest_MQ$FDR <= 0.10) & (ttest_MQ$logFC > 0.0), ])[1]
down <- dim(ttest_MQ[(ttest_MQ$FDR <= 0.10) & (ttest_MQ$logFC <= 0.0), ])[1]
print("This is like decideTest in edgeR - 10% FDR cut:")
up 
all - up - down
down
print("Candidate Counts:")
summary(ttest_MQ$candidate)
    
# what does the test p-value distribution look like?
ggplot(ttest_MQ, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(ttest_MQ$PValue, breaks = 100, plot = FALSE)$counts[26:100])) +
  ggtitle("MQ data with t-test p-value distribution")
[1] "This is like decideTest in edgeR - 10% FDR cut:"
1212
2407
1259
[1] "Candidate Counts:"
high
665
med
1254
low
552
no
2407

Candidate numbers and p-value distribution look okay

The total up and down candidate numbers are around 1200 like we had with edgeR. The number of candidates in the different FDR classes is different. The t-test has most candidates in the medium category in contrast to edgeR where most of the candidates were in the high class. The p-value distribution is qualitatively similar. These bird's eye views seem reasonable.

We can look at the more detailed plots (MA plots, scatter plots, and volcano plot) to see if a deeper look into the DE candidates still seems okay. We will repeat the edgeR plots here for comparison.

In [27]:
MA_plots(med_exo, "ave_med", "ave_exo", "MQ edgeR results", FALSE)
MA_plots(ttest_MQ, "ave_med", "ave_exo", "MQ t-test results")
In [28]:
scatter_plots(med_exo, "ave_med", "ave_exo", "MQ edgeR results", FALSE)
scatter_plots(ttest_MQ, "ave_med", "ave_exo", "MQ t-test results")
In [29]:
# compare volcano plots
volcano_plot(med_exo, "ave_med", "ave_exo", "EdgeR results", 4)
volcano_plot(ttest_MQ, "ave_med", "ave_exo", "t-test results", 4)
In [ ]:

Plain t-test is much worse than edgeR

Every aspect of the t-test looks pretty bad

  • too many protein with small fold-changes are significant
  • less correlation between low p-value and large fold change
  • less useful candidate categorization
  • statistically significant candidates are a little biased towards higher intensities
  • ranking the top candidates for follow up work looks more difficult to do